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Abstract 

Modern high-throughput gene perturbation screens are key technologies at the forefront of genetic re- 
search. Combined with rich phenotypic descriptors they enable researchers to observe detailed cellular 
reactions to experimental perturbations on a genome- wide scale. This review surveys the current state- 
of-the-art in analyzing single gene perturbation screens from a network point of view. We describe 
approaches to make the step from the parts list to the wiring diagram by using phenotypes for network 
inference and integrating them with complementary data sources. The first part of the review describes 
methods to analyze one- or low-dimensional phenotypes like viability or reporter activity; the second part 
concentrates on high-dimensional phenotypes showing global changes in cell morphology, transcriptome 
or proteomc. 

Introduction 

Functional genomics has demonstrated considerable success in inferring the inner working of a cell through 
analysis of its response to various perturbations. In recent years several technological advances have 
pushed gene perturbation screens to the forefront of functional genomics. Most importantly, modern 
technologies make it possible to probe gene function on a genome-wide scale in many model organisms 
and human. For example, large collections of knock-out mutants play a prominent role in the study of 
S. cerevisiae [1] and RNA interference (RNAi) has become a widely used high-throughput method to 
knock-down target genes in a wide range of organisms, including Drosophila melanogaster, C. elegans, 
and human [2-4]. 

Another major advance is the development of rich phenotypic descriptions by imaging or measuring 
molecular features globally. Observed phenotypes can reveal which genes are essential for an organism, or 
work in a particular pathway, or have a specific cellular function. Combining high-throughput screening 
techniques with rich phenotypes enables researchers to observe detailed reactions to experimental pertur- 
bations on a genome-wide scale. This makes gene perturbation screens one of the most promising tools 
in functional genomics. 

Advances in the design and analysis of gene perturbation screens may have an immediate impact on 
many areas of biological and medical research. New screening and phenotyping techniques often directly 
translate into new insights in gene and protein functions. Results of perturbation screens can also reveal 
uncxploitcd areas of potential therapeutic intervention. For example, a recent RNAi screen showed that 
some of the most critical protein kinases for the proliferation and survival of cancer cell lines are also the 
least studied [5]. 

A goal becoming more and more prominent in both experimental as well as computational research is 
to leverage gene perturbation screens to the identification of molecular interactions, cellular pathways and 
regulatory mechanisms. Research focus is shifting from understanding the phenotypes of single proteins 
to understanding how proteins fulfill their function, what other proteins they interact with and where 
they act in a pathway. Novel ideas on how to use perturbation screens to uncover cellular wiring diagrams 
can lead to a better understanding of how cellular networks are de-regulated in diseases like cancer. This 
knowledge is indispensable for finding new drug targets to attack the drivers of a disease and not only 
the symptoms. 
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Phenotypes A phcnotype can be any observable characteristic of an organism. Analysis strategies 
strongly depend on how rich and informative phenotype descriptors are. We will call phenotypes resulting 
from a single reporter (or a small number of reporters) low- dimensional phenotypes and the genes showing 
significant results hits [6,7]. Examples of such low-dimensional phenotypes are cell viability versus cell 
death [1] , growth rates [8] or the activity of reporter constructs, e.g. a luciferase, downstream of a pathway 
of interest [9]. Low-dimensional phenotyping screens can identify candidate genes on a genome- wide scale 
and are often used as a first step for follow-up analysis. We will discuss methods to functionally interpret 
hits from low-dimensional phenotyping screens and to place them in the context of cellular networks in 
the first part of this review. 

The second part will be devoted to high- dimensional phenotyping screens, which evaluate a large 
number of cellular features at the same time. Observing system-wide changes promises key insights into 
cellular mechanisms and pathways that can not be supplied by low-dimensional screens. For example, 
high-dimensional phenotypes can include changes in cell morphology [10-13], or growth rates under a 
wide range of conditions [14], or transcriptional changes measured on microarrays [15-18], or changes in 
the metabolome and proteome [19] measured by mass spectrometry [20] or flow cytometry [21,22]. Mor- 
phological and growth phenotypes can be obtained on a genome- wide scale [13,14], while transcriptional 
and proteomic phenotypes are often restricted to individual pathways or processes [16, 17,21]. 

The distinction between low- and high-dimensional phenotypes may sound technical, but it is crucial 
for choosing potential analysis methods. The central difference is that high-dimensional phenotypes 
allow to compute correlations and other similarity measures, which are not applicable for low-dimensional 
phenotypes. Another important distinction is between static phenotypes, providing a 'snapshot' of a cell's 
reaction to a gene perturbation, and dynamic phenotypes showing a cell's reaction over time. We expect 
more and more studies in the future to produce dynamic output and in the following note explicitly which 
methods can be applied to dynamic phenotypes. For the biological interpretation of screening results it is 
very important to keep in mind which level of 'cellular granularity' a phenotype describes: growth rates 
or cell morphologies are much more 'high-level' features of the cell than gene or protein expressions. As 
soon as more studies produce dynamic phenotypes on many different cellular levels, integrative analysis of 
inter-connected phenotypes [23] will become more important. In the following, however, we concentrate 
on the current state-of-the art, which almost always uses a single type of readout in a perturbation screen. 

Pre-processing pipeline In this review we focus on single gene perturbations by knockouts [1] or RNA 
interference [4] that allow targeting individual genes or combinations of genes. Before network analysis, 
the raw data needs to pass an initial analysis and quality control pipeline specific to the perturbation and 
phenotyping technologies used. Low-dimensional screens are mostly performed in multiplc-wcll-plates 
and a typical analysis pipeline [4] includes data pre-processing, removal of spatial biases per plates, 
normalization between plates, and finally detection of significant hits [6,7,24]. In vertebrates, genes need 
to be targeted with multiple siRNAs to ensure effective down-regulation [4] and the multiple phenotypes 
per gene can afterwards be integrated into a statistical score [25] . High-dimensional morphological screens 
depend on computational analysis like image segmentation [26, 27] and phenotype discovery [28-30] for 
rapid and consistent phenotyping. Molecular high-dimensional phenotypes need pre-processing depending 
on their platform and different approaches exist e.g. for flow-cytometry data [31] or microarrays [32]. 

From phenotypes to cellular networks The phenotypes we have discussed above allow only an 
indirect view on how different genes in the same process interact to achieve a particular phenotype. 
Cell morphology or sensitivity to stresses, for example, are global features of the cell and hard to relate 
directly to how individual genes contribute to them (see Fig. la). Gene expression phenotypes show 
transcriptional changes in the genes downstream of a perturbed pathway but offer only an indirect 
view of pathway structure due to the high number of non-transcriptional regulatory events like protein 
modifications [33]. For example, different protein activation states by phosphorylation may not be visible 
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by changes in naRNA concentrations (see Fig. lb). 

This gap between observed phenotypes and underlying cellular networks is the main problem in the 
analysis of perturbation screens and applies to both low- and high-dimensional screens. The goal of 
computational analysis is to bridge this gap by inferring gene function and recovering pathways and 
mechanism from observed phenotypes. The following methods address the challenge in different ways, 
mostly by integrating the perturbation effects and phenotypes with additional sources of information like 
collections of functionally related gene sets or protein-interaction networks. 

Network analysis of low-dimensional phenotypes 

Global overview by enrichment analysis A simple way to link phenotypes to gene function is 
to test whether pathways or functional groups of genes (e.g. defined by Gene Ontology terms [34] or 
MSigDB [35]) are enriched in the list of hits. Most methods use a hypergeometric test statistic (see 
Fig. 2a) and many can be used online [36-38] or as Bioconductor packages [39]. An alternative global 
functional annotation method tests whether functional groups show a trend towards especially strong or 
weak phenotypes without using a cutoff to define hits [35] (see Fig. 2b). Enrichment analysis can also 
be very useful to analyze high-dimensional phenotypes, for example when functionally annotating the 
results of a clustering method. 

Enrichment analysis results in a list of p-values describing how significantly each gene set was repre- 
sented in the hits. Enrichment analysis reduces complexity and improves interpretability of results by 
moving from single genes to functionally related gene sets. This type of analysis is often called 'un-biased' 
and 'hypothesis-free' and is ideal for a comprehensive first overview. However, enrichment analysis loses 
its value for complexity reduction if the number of gene sets becomes too big. Also, overlap and depen- 
dencies between gene lists that could potentially bias the results have so far only been addressed for the 
GO graph [38,39] but not for more general collections of gene lists like MSigDB [35]. 

Good data analysis asks specific questions. A hypothesis-free method can only be the very first 
starting point for a deeper exploration of the data. For example, all enrichment methods rely on known 
gene sets and cannot uncover new pathways or components. Enrichment methods treat pathways as bags 
of unconnected genes without considering connections within and between pathways. Thus, enrichment 
methods can only deliver a very crude picture of the cell. In the following we will discuss approaches 
to overcome some of the limitations of enrichment analysis by integrating the observed phenotypes with 
complementary sources of information. 

Mapping phenotypes to networks Another valuable source of information to interpret RNAi hits 
are gene and protein networks obtained cither experimentally [40, 41] or computationally by literature 
mining [42] or integrating heterogeneous genomic data [43-45] . All computational networks are available 
online on supplementary webpages and the experimental networks can be obtained from databases like 
STRING [46] or BioGRID [47]. 

Using these complementary data sources can improve hit identification [48-50] and even provide a 
more refined view of the pathways the hits contribute to. One strategy is to search for sub-networks 
containing a surprisingly large number of hits (see Fig. 3a). While this strategy is already useful when 
evaluating interesting sub-networks by eye [51, 52] its true power comes from the availability of efficient 
search algorithms to find sub-networks enriched for RNAi hits and assess their significance [53-57]. An 
additional application of mapping hits to a network is that known phenotypes can be used to predict 
phenotypes of genes not included in the screen, e.g. by assuming that a gene connected to many hits 
should also show a strong phenotype [51]. The success of all network-mapping strategies strongly depends 
on the quality and coverage of both the screen and the linkage in the network. 
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Gene prioritization Other approaches complement genomic data with biological prior knowledge 
showing how 'interesting' hits look like. Gene prioritization [49,58] ranks genes according to how promising 
they would be for follow-up studies. Because it uses prior knowledge to fine-tune the algorithm, gene 
prioritization can be more focussed than a global un-informed search for enriched subnetworks. 

Network analysis of high-dimensional phenotypes 

Global overview by clustering and ranking. Most state-of-the-art analysis techniques rely on a 
guilt-by-association paradigm: genes with similar phenotypes will most probably have a similar biological 
function. This explains the prevalence of clustering techniques in analyzing high-dimensional phenotyping 
screens [10, 13, 14,17] . Clustering is a convenient first analysis and visualization step that can can highlight 
strong trends and patterns in the data and can thus yield a global first impression of functional units. 
Another analysis strategy relying on guilt-by-association is to rank genes by their phenotypic similarity 
compared to a gene of interest [11]. Clustering and ranking can be combined with enrichment analysis 
(as discussed above) for functional interpretation. 

Graph methods linking causes to effects Another useful data visualization especially for transcrip- 
tional phenotypes is to build a directed (not necessarily acyclic) graph by drawing an arrow between two 
genes if perturbing one results in a significant expression change at the other [59]. This graph representa- 
tion can be then used as a starting point for further analysis, for example by using graph-theoretic methods 
of transitive reduction [60] to distinguish between direct and indirect effects of a perturbation [61,62]. 

Probabilistic graphical models. Most approaches to infer pathway structure from experimental 
data rely on probabilistic graphical models. For low-dimensional phenotypes they often suffer from non- 
uniqueness and un-idcntifiability issues [63] , but can be applied very successfully in high-dimensional set- 
tings. A prominent approach are (static or dynamic) Bayesian networks, which describe probabilistically 
how a gene is controlled by its regulators [64,65]. To model experimental perturbations most approaches 
rely on the concept of 'ideal interventions' [66] which deterministically fix a target gene to a particular 
state (e.g '0' for a gene knockout). Ideal interventions were applied in Bayesian networks [21,67,68], 
factor graphs [69] and dependency networks [70]. In simulations [71,72] and on real data [21,71] it was 
found that interventions are critical for effective inference. 

The model of ideal interventions contains a number of idealizations (hence the name), most impor- 
tantly that manipulations only affect single genes and that perturbation strength can be controlled deter- 
ministically. The first assumption may not be true if there are off-target or compensatory effects involving 
other genes. The second assumption may also not hold true in realistic biological scenarios; in particular 
for RNAi screens where experimentalists often lack knowledge about the exact knock-down efficiency. 
Probabilistic generalizations of ideal interventions can be used to cope with this uncertainty [73]. 

Probabilistic data integration High-dimensional phenotypic profiles can be mapped to given graphs 
and networks by finding subgraphs that are connected in the background network and at the same time 
show high similarity of phenotypic profiles. These approaches already exist for mapping gene expression 
data onto protein interaction networks [74] and the same algorithms could easily be applied to any other 
kind of high-dimensional phenotypic profiles (see Fig. 3b). Other approaches use data integration to 
construct potential pathways from protein interactions and transcription factor binding data to relate 
perturbed genes to the observed downstream effects [75-77] . 

Multiple Input - Multiple Output (MIMO) models Many of the approaches discusses so far- 
like clustering or graphical models-can be applied to both static 'snapshots' as well as dynamic time- 
course measurements. Another approach to model specifically the dynamics of networks comes from a 
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branch of control theory called 'systems identification' [78] and uses so called Multiple Input - Multiple 
Output (MIMO) models. MIMO models represent the evolution of a perturbed cell over time by linear 
differential equations [79-83] and can represent non- linear effects by transfer functions [84]. The models 
can be inferred by regression techniques in the linear case [80] or Monte Carlo stochastic search in the 
non-linear case [84] . The framework is very flexible and can incorporated single as well as combinatorial 
perturbations. 

Nested Effects Models (NEMs) One of the key problems in analyzing perturbation screens is that 
the observed phenotypes are downstream of the perturbed pathway and may not show the direct influence 
of one pathway component on another. A class of models explicitly addressing this problem are Nested 
Effects Models [33,85]. They reconstruct pathway structure from subset relations based on the following 
rationale: Perturbing some genes may have an influence on a global process, while perturbing others 
affects sub-processes of it. Imagine, for example, a signaling pathway activating several transcription 
factors. Blocking the entire pathway will most probably affect all targets of all transcription factors, 
while perturbing a single transcription factor will only affect its direct targets, which are a subset of the 
phenotypc obtained by blocking the complete pathway. Given high-dimensional phenotypes showing a 
subset structure, NEMs find the most likely pathway topology explaining the data. They differ from other 
statistical approaches like Bayesian networks by encoding subset relations instead of correlations or other 
similarity measures. The theory of NEMs has been applied and extended in several studies [86-89]. An 
implementation is available as an R/Bioconductor package [90]. Other extensions to the NEM framework 
distinguish between activating and inhibiting regulation [91] or include dynamic information from time- 
series measurements [92]. 

Discussion and Outlook 

In this review we have discussed two main approaches to describe the reaction of a cell to an experimental 
gene perturbation: low-dimensional phenotypes measure individual reporters for cell viability or pathway 
activation, while high-dimensional phenotypes show global effects on cell morphology, transcriptome or 
proteome. Table 1 lists examples of freely available software implementing some of these approaches. 
All of them can be directly applied to gene perturbation screens, even though some of them have been 
introduced in different contexts. While this review has focused on single gene knock-outs and knock- 
downs, similar approaches can be applied to gene over-expression screens [22,83,93,94], drug treatment 
[84], environmental stresses changing many genes [95,96] or even natural genetic variation [97]. 

Predicting phenotypes from metabolic networks The focus of this review is on functionally 
annotating hits in a network context and reconstructing networks from high-dimensional phenotypes. In a 
complementary direction of research, genome-wide reconstructions of metabolic networks [98,99] are used 
to predict effects of gene perturbations. Instead of predicting networks from phenotypes, these approaches 
predict phenotypes from networks. For example, in S. cerevisiae and E. coli computational models very 
accurately predict fitness effects of gene knock-outs [100,101] as well as compensatory rescue effects [102]. 
However, recent developments in metabolic network modeling have led to linear programming algorithms 
to extract relevant context-specific sub-networks of activity from a genome- wide network [103, 104]. In 
the same way as the probabilistic data integration methods discussed above, e.g. [74], these algorithms 
could be used in the future to find metabolic sub-networks active under certain gene perturbations. 

From single to combinatorial perturbations While single gene perturbation screens have been 
immensely successful in extending our knowledge of pathway components and interactions, an important 
limitation can be caused by compensatory effects, genetic buffering and redundancy of cellular mechanisms 
and pathways [105, 106]. This can only be overcome by perturbing several genes at the same time. 
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The number of possible combinations grows rapidly and thus current approaches are mainly limited to 
perturbing pairs of genes and observing low-dimensional phenotypes like fitness estimates [107]. The 
analysis of combinatorial perturbations is the topic of another review [108]. 

The end of the screen is the beginning of the experiment Global phenotyping and pathway 
screening can be combined in the same study. For example, a first genome-wide screen identifies key 
genes representative for pathways and cellular mechanisms involved in the phenotype. In a second step 
the hits of the first screen could be assayed for high-dimensional molecular phenotypes to infer a pathway 
diagram using Nested Effects Models or other statistical approaches. 

In a further step this preliminary pathway models could be used to plan an additional round of 
experimentation. Different modeling frameworks propose future experiments to most effectively refine a 
pathway hypothesis, e.g. Bayesian networks [109,110], physical network models [76], logical models [111], 
Boolean networks [112], and dynamical modeling [79]. 

Iteratively integrating experimentation and computation may lead to a virtuous circle and is one of 
the most promising approaches to refine our understanding of the inner working of the cell. 
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Figures and legends 
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Figure 1. Cellular networks underlying observable phenotypes. (a) Phenotypes are the 
response of the cell to external signals mediated by cellular networks and pathways. The goal of 
computation is to reconstruct these networks from the observed phenotypes. (b) Global molecular 
phenotypes like gene expression allow a view inside the cell but also have limitations. This is 
exemplified here in a cartoon pathway adapted from [61] showing a cascade of five genes/proteins (A-E). 
Proteins A-C form a kinase cascade, D is a transcription factor acting on E. Up-regulation of A starts 
information flow in the cascade and results in E being turned on. In gene expression data this is visible 
as a correlation between A and E (represented as an undirected edge in the model). Experimentally 
perturbing a genes, say C, removes the corresponding protein from the cascade, breaks the information 
flow and results in an expression change at E (represented as an arrow in the model). However, the 
different phosphorylation and activation states of proteins B-D will most probably not be visible as 
changes in gene expression. Thus, due to the pathway mostly acting on the protein level most parts of 
the cascade (dashed arrows in the model) can not be inferred from gene expression data directly. 
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Figure 2. Functional annotation of hits by enrichment analysis, (a) In the first approach [38] 
a cutoff is applied to select the hits with strongest phenotypes. A hyper-geometric test then evaluates if 
the overlap between the hits and a given gene set is surprisingly large (or small) compared to the 
overlap with a random set. (b) A second approach [35] does not need a cutoff. It maps the gene set 
(black bars) onto the observed phenotypes and quantifies if there is a significant trend or if the genes 
are spread out uniformly over the whole range. 
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Figure 3. Extracting rich sub-networks. Different patterns in the graph point to a common 
cellular mechanism causing a phenotype: (a) hits in a low-dimensional screen (red nodes) clustering in 
highly connected sub-networks, and (b) high correlation between high-dimensional phenotypes of target 
genes connected in the background network. The black graph represents any type of background 
network. 



Network analysis of gene perturbation screens 



15 



Tables 

Table 1. Examples of software for network analysis of gene perturbation screens. 



General data analysis and network visualization 



Bioconductor 
Cytoscape 


Software environment for the analysis of genomic 
data featuring hundreds of contributed packages 
[113] 

Software platform for visualizing molecular interac- 
tion networks and integrating them with other data 
types [114] 


www . bioconductor . org 
www.cytoscape.org 


Setting up data for network analysis 


ccllHTS2 

RNAither 

EBImage 
CcllProfiler 


End-to-end analysis of cell-based screens: from raw 
intensity readings to the annotated hit list [6] 
Analysis of cell-based RNAi screens, includes quality 
assessment and customizable normalization [7] 
Cell image analysis and feature extraction [27] 
Cell image analysis and feature extraction [26] 


www . bioconductor . org 

www.bioconductor.org 

www.bioconductor.org 
www . cellprofiler . org 


Enrichment analysis 


DAVID 
GOLEM 
Ontologizer 
GSEA 


Tools for data annotation, visualization and integra- 
tion [36] 

Enrichment analysis and visualization of GO graph 
(Fig 2a) [37] 

Enrichment analysis with dependencies between GO 
nodes (Fig 2a) [38] 

Gene set enrichment analysis (Fig 2b) [35] 


david . abec .ncifcrf . gov 
function.princeton.edu/GOLEM 
compbio.charite.de/ontologizer 
www.broadinstitute.org/ gsea/ 


Clustering and ranking 


Cell Profiler 

Analyst 

PhenoBlast 

Endeavour 


Interactive exploration and analysis of multidimen- 
sional data from image-based experiments [28] 
Ranking of phenotype profiles according to similarity 
with given profile [11] 
Prioritizes hits for further analysis [58] 


www. cellprofiler . org 
www.rnai.org 

www.esat.kuleuven.be/endeavour / 


Finding rich sub-networks 


hcinz 

jActiveModuk 
Matisse 


Finds optimal subnetworks rich in hits (Fig 3a) [55] 
s Finds heuristic subnetworks rich in hits (Fig 3a) [53] 
Finds subnetworks with high phenotypic similarity 
(Fig 3b) [74] 


www.planet-lisa.net 
www.cytoscape.org 
acgt.cs.tau.ac.il/ matisse / 


Network reconstruction 


nena 
copia 


Nested Effects Models reconstruct pathway features 
from subset relations in high-dim phenotypes [90] 
Copia uses MIMO models to reconstruct networks 
from perturbations [84] 


www . bioconductor . org 
cbio.mskcc.org/ copia/ 



The table contains the name of the method, a short description with reference, and a webpage where it 
can be obtained. This list is far from comprehensive, but hopefully provides a starting point even for 
non-coding experimentalists. 



